Class Announcements

Due Today

  • A1

Due Friday:

  • D2

Due Monday:

  • Q3

Notes:

  • Repo invites:Take it ASAP, they become invalid after 7 days

Data Visualization

  • tools:
    • seaborn - generating plots
    • pandas - wrangling data
    • matplotlib - fine-tuning plots
  • plotting
    • quantitative data
    • categorical data
  • customizing visualizations

Open In Colab

For more information on this topic, check out: (1) Jake VanderPlas' Python Data Science Handbook and (2) Berkeley's Data 100 Textbook.

A good data visualization can help you:

  • identify anomalies in your data
  • better understand your own data
  • communicate your findings

Quick Introduction

95%+ of plots fall into just a few types:

  • single variable
    • continuous
    • discrete
  • discrete vs discrete
  • discrete vs continuous
  • continuous vs continuous

Basic Visualizations

  • histograms
  • densityplots
  • scatterplot
  • barplot
    • grouped barplot
    • stacked barplot
  • boxplot (and related things like violinplots, etc)
  • line plot

Variable types : Plots

  • statistical/distribution of quantitative variable
    • single variable
      • histogram
      • densityplot
    • single variable x categorical variable
      • boxplot
  • count data
    • count data x categorical variable
      • barplot
    • count data x 2 categorical variables
      • grouped bar plot
      • stacked bar plot
  • Directly view quantitative variables
    • one variable x time
      • line plot
    • one variable x time x categorical variable
      • multiple lines on the same plot
    • two (or maybe 3) quantitative variables
      • scatter plot

Source: Storytelling with Data (Nussbaumer Knaflic)

Source: Storytelling with Data (Nussbaumer Knaflic)

https://forms.gle/Dn1k7uHoQSwoVoHS7

Clicker Question #1

You want to visualize how many people in your dataset prefer chocolate chip cookies and how many prefer oatmeal raisin cookies.

What type of visualization would be most appropriate?

  • A) histogram
  • B) scatterplot
  • C) barplot
  • D) boxplot
  • E) line plot

Clicker Question #2

You're interested in visualizing how many servings of milk an individual drinks each day among those who prefer chocolate chip cookies and those who prefer oatmeal raisin cookies.

What type of visualization would be most appropriate?

  • A) histogram
  • B) scatterplot
  • C) barplot
  • D) boxplot
  • E) line plot

Clicker Question #3

You're interested in visualizing how many servings of milk an individual drinks each year over the course of their life.

What type of visualization would be most appropriate?

  • A) histogram
  • B) scatterplot
  • C) barplot
  • D) boxplot
  • E) line plot

Plotting in Python: Getting Started

First we'll import the libraries we'll use for plotting.

# import working with data libraries
import pandas as pd
import numpy as np

# import seaborn
import seaborn as sns

# import matplotlib
import matplotlib.pyplot as plt # Typical way of import MPL
import matplotlib as mpl # This line is used less frequently

#improve resolution
#comment this line if erroring on your machine/screen
%config InlineBackend.figure_format ='retina'
sns.__version__

seaborn

seaborn is a great place to get started when generating plots that don't look awful.

Class Data

With the libraries we need imported, the first dataset we'll use today is data from the COGS 108 class survey from the Spring of 2019.

df = pd.read_csv('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/df_for_viz.csv')
df.shape
df.head()

Wrangling that's been done:

  • removed lots of identifying information
  • standardized gender & job
  • separated out programming responses
df.describe()
df['statistics'].hist();

Quantitative Variables

  • histograms
  • densityplots
  • scatterplots

Histograms and Densityplots

Histograms & Densityplots are helpful for visualizing information about a single quantitative variable.

We can use seaborn's histplot function. (distplot in older versions of seaborn)

# set plotting size parameter
plt.rcParams['figure.figsize'] = (16,10) #default plot size to output
sns.set_theme(context='notebook',
              style='white',
              font_scale=3)
              #rc={'axes.spines.right': False,'axes.spines.top': False} )
# histogram
#`distplot` in older versions of `seaborn`
sns.histplot(df['statistics'], bins=10, kde=False);

One thing to note about histograms is the fact that the number of bins displayed plays a large role what the viewer takes away from the visualization.

# `distplot` in older versions of `seaborn`
# just histogram - set kde = False
sns.histplot(df['statistics'], bins=10);

# Alternative approach using pandas
# df['statistics'].hist(bins=10);

This doesn't follow "visualization best practices."

Visualization Best Practices

  • Choose the right type of visualization
  • Be mindful when choosing colors
  • Label your axes
  • Make text big enough
  • Keep it simple
  • Less is more:
    • Aim to improve your data:ink ratio
    • Everything on the page should serve a purpose. If it doesn't, remove it.

Best Practices: Example

Source: Storytelling with Data (Nussbaumer Knaflic)

Ideas:

  • Pros:
    • consistent colors from left to right
    • values provided for each slice
    • overall picture
  • Cons:
    • text size
    • legend not ideal
    • colors are not intuitive
    • pie chart not ideal b/c of # of categories

Suggestions:

  • different visualiztion - stacked barplot?

Clicker Question #4

Consider what are some positive and some negative aspects of this visualization. Click in when you have finished thinking.

  • A) I have some ideas!
  • B) I've got no ideas.
  • C) I'm not sure what I'm supposed to be thinking about.

Source: Storytelling with Data (Nussbaumer Knaflic)

Source: Storytelling with Data (Nussbaumer Knaflic)

Source: Storytelling with Data (Nussbaumer Knaflic)

Less is more

The less is more approach suggests that we should probably get rid of this background color now and remove the gridlines. We'll use the less is more approach as we work through the other types of visualizations.

Let's improve that now for our original plot...

# `distplot` in older versions of `seaborn`
# change color to dark grey
ax = sns.histplot(df['statistics'],  
                  bins=10, color='darkgrey')
# remove the top and right lines
sns.despine()

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students are moderately comfortable with statistics')
ax.set_ylabel('Count')
ax.set_xlabel('How comfortable are you with statistics?');
# kdeplot to only display the densityplot
ax = sns.kdeplot(df['programming'], color='#686868')

# remove the top and right lines
sns.despine()

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students are pretty comfortable with programming')
ax.set_ylabel('Count')
ax.set_xlabel('How comfortable are you with programming?');

Scatterplots

Scatterplots can help visualize the relationship between two quantitative variables.

sns.scatterplot(x='programming', y='statistics', data=df,
                #alpha=0.1 # comment this in and out
               );

# alternative with pandas
# df.plot.scatter('programming', 'statistics');
# jitter points to see relationship, try different levels of it
sns.lmplot(x='programming', y='statistics', data=df,
           fit_reg=False, height=6, aspect=2,
          x_jitter=.15, y_jitter=.15);
# fit a linear model, showing the line of best fit 
# and also 95% confidence interval on the fit
sns.lmplot(x='programming', y='statistics', data=df,
           fit_reg=True, height=6, aspect=2,
          x_jitter=.20, y_jitter=.20);

Clicker Question #5

What can we say about the relationship between students' comfortability with programming and statistics?

  • A) Students who are more comfortable programming are more comfortable with statistics
  • B) Students sho are more comfortable programming are less comfortable with statistics
  • C) There is little relationship between students' comfort level with programming and statistics

Scatterplots (by a categorical variable)

When you want to plot two numeric variables but want to get some insight about a third categorical variable, you can color the points on the plot by the categorical variable.

COLOR !

By the way you REALLY need to read this... over and over over again: https://seaborn.pydata.org/tutorial/color_palettes.html

# control color palette
unique = pd.concat([df["lecture_attendance"], df["gender"]]).unique()
my_palette = dict(zip(unique, sns.color_palette()))
my_palette.update({"Total":"k"})
print(my_palette)
# color points by gender is
sns.lmplot(x='programming', y='statistics', data=df, hue='gender',
           fit_reg=True, height=6, aspect=2, 
           x_jitter=.5, y_jitter=.5,
           palette=my_palette);
sns.lmplot(x='programming', y='statistics', data=df, hue='lecture_attendance',
           fit_reg=True, height=6, aspect=2, 
           x_jitter=.5, y_jitter=.5,
           palette=my_palette);

Clicker Question #6

What can we say about the relationship between students' comfortability with programming and statistics and gender? And, how easy is this to determine?

  • A) Females and Other/Prefer not to say tend to be more comfortable with programming; easy to determine
  • B) Females and Other/Prefer not to say tend to be more comfortable with programming; difficult to determine
  • C) Males tend to be more comfortable with programming; easy to determine
  • D) Males tend to be more comfortable with programming; difficult to determine
  • E) I'm super lost.

We don't get a ton more information here, but what we may see a slight shift in programming comfortability to include more males relative to females. To better understand this, a boxplot would be helpful. (We'll look at this shortly.)

Categorical Variables

  • barplots
  • grouped barplots
  • stacked barplots

Barplots

In seaborn there are two types of bar charts:

  1. countplot - counts the number of times each category appears in a column
  2. barplot - groups dataframe by a categorical column and plots the height bars according to the average of a numerical column within each group (This is usually not the right way to visualize quantitative data, so we're not covering it in this class.)
# generate default barplot
sns.countplot(x='lecture_attendance', 
              data=df#COMMENT THIS IN AND OUT .replace({ 'I prefer to attend lecture': 'prefer to attend', 'I prefer not to attend lecture (i.e. catch up later, listen to podcast, etc.)': 'prefer to not attend'} )
             );
ax = sns.countplot(x='lecture_attendance', 
                   data=df, color = '#686868')

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students prefer to attend lecture')
ax.set_ylabel('Count')
ax.set_xlabel('Lecture Attendance Preference')
# set tick labels
ax.set_xticklabels(("attend", "not attend"));
ax = sns.countplot(x='gender', data=df, color='#686868')

# add title and axis labels (modify x-axis label)
ax.set_title('There are more males than females in COGS108')
ax.set_ylabel('Count') 
ax.set_xlabel('Gender');

It's often a good idea to order axes from largest to smallest for categorical data.

ax = sns.countplot(x='gender', data=df, color = '#686868',
             order=['male', 'female', 'other or prefer not to say'])

# add title and axis labels (modify x-axis label)
ax.set_title('Male is the most prevalent gender in COGS108.')
ax.set_ylabel('Count')
ax.set_xlabel('Gender');
# warning: not seaborn
# pandas approach
# proportion of the class familiar with each programming language
a = df.iloc[:,5:11].sum()/len(df)
a = a.sort_values(axis=0, ascending=False)
a.plot.bar(color='#686868', rot=0);

Grouped Barplots

# same color palette as defined earlier
# generate grouped barplot by specifying hue
ax = sns.countplot(x='lecture_attendance', hue='gender',
                   data=df, palette=palette, )

# add title and axis labels (modify x-axis label)
ax.set_title('Most COGS108 students prefer to attend lecture')
ax.set_ylabel('Count')
ax.set_xlabel('Lecture Attendance Preference')
ax.set_xticklabels(('attend', 'not attend'));

Because we have different numbers of males and females, comparing counts is not all that helpful...

We need proportions.

Stacked Barplots

# warning: this is not seaborn
df2 = df.groupby([ 'lecture_attendance','gender'])['lecture_attendance'].count().unstack('gender').fillna(0)
sub_df2 = np.transpose(df2.div(df2.sum()))

# generate plot
ax = sub_df2.plot(kind='bar', stacked=True, rot=0,
                  title='Lecture Attendance does not appear to differ by gender')

# customize plot
ax.legend(('not attend','attend'), loc='center left', bbox_to_anchor=(1.0, 0.5))
ax.set_ylabel("Proportion of students");

More plots

  • boxplots (quantitative + categorical)
  • lineplots (quantitative over time)

Boxplots

By default, the box delineates the 25th and 75th percentile. The line down the middle represents the median. "Whiskers" extend to show the range for the rest of the data, excluding outliers. Outliers are marked as individual points outside of the whiskers.

# generate boxplots
sns.boxplot(y='statistics', x='gender', data=df);

Outlier determination

Outliers show up as individual points on boxplots. But, we don't see any on this boxplot. Let's see why...

# determine the 25th and 75th percentiles
lower, upper = np.percentile(df['statistics'], [25, 75])
lower, upper
# calculate IQR
iqr = upper - lower
iqr

Typically, the inter-quartile range (IQR) is used to determine which values get marked as outliers. The IQR is: 75th percentile - 25th percentile. Values greater than 1.5 x IQR above the 75th or below the 25th percentile are marked as outliers.

# calculate lower cutoff
# values below this are outliers 
lower_cutoff = lower - 1.5 * iqr

# calculate upper cutoff
# values above this are outliers 
upper_cutoff = upper + 1.5 * iqr

lower_cutoff, upper_cutoff

Boxplots really shine when you want to look at the range of typical values for a quantitative variable, broken down by a separate categorical variable.

# generate boxplots
# we can make sure the colors match what we used earlier for the same groups
ax = sns.boxplot(x='gender', y='statistics', data=df)

ax.set_title('Gender not related to comfort with statistics')
ax.set_ylabel('Comfort with Statistics')
ax.set_xlabel('Gender');
# generate boxplots
# we can make sure the colors match what we used earlier for the same groups
ax = sns.boxplot(x='gender', y='statistics', data=df, palette=palette)

ax.set_title('Gender not related to comfort with statistics')
ax.set_ylabel('Comfort with Statistics')
ax.set_xlabel('Gender');

Much better!

Histograms (by a categorical variable)

The same data plotted as a histogram are not so easily interpretable.

# `distplot` in older versions of `seaborn`
sns.histplot(df.loc[df['gender'] == 'female', 'statistics'], kde=True, color="red")
sns.histplot(df.loc[df['gender'] == 'male', 'statistics'], kde=True, color="purple")
sns.histplot(df.loc[df['gender'] == 'other or prefer not to say', 'statistics'], kde=True);

Customization: births data

Now that we're getting the hang of this, let's see how complicated things can get. We'll return to using a line chart to look at birth patterns over time.

# get the data
births = pd.read_csv('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/data/births.csv')
births.head()
births.year.max()
from datetime import datetime

# calculate values & wrangle
quartiles = np.percentile(births['births'], [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')

births['day'] = births['day'].astype(int)

births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births',
                                    [births.index.month, births.index.day])
births_by_date.index = [datetime(2012, month, day)
                        for (month, day) in births_by_date.index]


# plot the thing
fig, ax = plt.subplots(figsize=(22, 5))
births_by_date.plot(ax = ax)
ax.get_legend().remove()

What are all those dips? Well, let's annotate the plot to get a better sense of what's going on.

# plot the thing
fig, ax = plt.subplots(figsize=(22, 7))
births_by_date.plot(ax=ax)
ax.get_legend().remove();

# define style
style = dict(size=16, color='gray')

# add annotation
ax.text('2012-1-1', 3950, "New Year's Day", **style)
ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style)
ax.text('2012-9-4', 4850, "Labor Day", ha='center', **style)
ax.text('2012-10-31', 4600, "Halloween", ha='right', **style)
ax.text('2012-11-25', 4450, "Thanksgiving", ha='center', **style)
ax.text('2012-12-25', 3850, "Christmas ", ha='right', **style)

# label the axes
ax.set(title='USA births by day of year (1969-1988)',
       ylabel='average daily births')

# format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));

Annotation directly on plots can help explain the plot to viewers.

Saving Plots

While we're using a Jupyter notebook right now, you won't always be. So, you'll need to know how to save figures.

# save fig to plots directory
# this will only work if you have 
# a plots directory in your working directory
fig.savefig('my_figure.png',dpi=300)

Note that the file format is inferred from the extension you specify in the filename.

To see which file types are supported:

fig.canvas.get_supported_filetypes()

Viewing Saved Plots

Once a plot is saved, it may be helpful to view it through IPython or your notebook. To do so, you'd use the following:

Can import with Markdown formatting... (or with HTML in a markdown cell)

dates figure

# to see contents of a saved image
from IPython.display import Image
Image('https://github.com/COGS108/Lectures-Sp22/raw/master/03_ethics/images/my_figure.png')